Asymmetric adversarial learning framework for multi-turn dialogue response generation

ABSTRACT

In a variety of embodiments, machine classifiers may model multi-turn dialogue as a one-to-many prediction task. The machine classifier may be trained using adversarial bootstrapping between a generator and a discriminator with multi-turn capabilities. The machine classifiers may be trained in both auto-regressive and traditional teacher-forcing modes, with the generator including a hierarchical recurrent encoder-decoder network and the discriminator including a bi-directional recurrent neural network. The discriminator input may include a mixture of ground truth labels, the teacher-forcing outputs of the generator, and/or noise data. This mixture of input data may allow for richer feedback on the autoregressive outputs of the generator. The outputs can be ranked based on the discriminator feedback and a response selected from the ranked outputs.

TECHNICAL FIELD

The present disclosure is generally related to the generation ofautomated responses to user input.

BACKGROUND

Computer generated responses to user input such as dialogue, images, andthe like, are often limited in diversity and/or not particularlyrelevant to the user input. For example, computer generated responses touser input such as dialogue in conventional systems may include phrasessuch as “I don't know,” “I'm sorry,” and “I don't know what you aretalking about,” that are safe, limited in diversity, and notparticularly relevant to the topic of the conversation.

SUMMARY

The following presents a simplified summary of various aspects describedherein. This summary is not an extensive overview, and is not intendedto identify key or critical elements or to delineate the scope of theclaims. The following summary merely presents some concepts in asimplified form as an introductory prelude to the more detaileddescription provided below. Corresponding apparatus, systems, andcomputer-readable media are also within the scope of the disclosure.

While advances in machine learning, especially within deep neuralnetworks, have enabled new capacity for machines to learn behavior fromrepository human behavioral data, existing neural network architectureand/or methodology continue to produce computer generated responses touser input that are limited in diversity and/or not particularlyrelevant to the topic of the input data. Aspects described herein mayaddress these and other problems, and generally improve the quality andcapabilities of machine classifiers trained to perform classificationtasks. Systems and methods described herein may use machine classifiersto perform a variety of natural language understanding tasks including,but not limited to multi-turn dialogue generation. Existing open domainneural dialogue models are known to produce responses that lackrelevance, diversity, and in many cases coherence. These shortcomingsstem from the limited ability of common training objectives to directlyexpress these properties as well as their interplay with trainingdatasets and model architectures.

In a variety of embodiments, machine classifiers may model multi-turndialogue as a one-to-many prediction task. The machine classifier may betrained using adversarial bootstrapping between a generator and adiscriminator with multi-turn capabilities. The machine classifiers maybe trained in both auto-regressive and traditional teacher-forcingmodes, with the generator including a hierarchical recurrentencoder-decoder network and the discriminator including a bi-directionalrecurrent neural network. The discriminator input may include a mixtureof ground truth labels, the teacher-forcing outputs of the generator,and/or noise data. This mixture of input data may allow for richerfeedback on the autoregressive outputs of the generator. The outputs canbe ranked based on the discriminator feedback and a response selectedfrom the ranked outputs.

These features, along with many others, are discussed in greater detailbelow.

BRIEF DESCRIPTION OF THE DRAWINGS

The present disclosure is described by way of example and not limited inthe accompanying figures in which like reference numerals indicatesimilar elements and in which:

FIG. 1 shows an example of an operating environment in which one or moreaspects described herein may be implemented;

FIG. 2 shows an example computing device in accordance with one or moreaspects described herein;

FIGS. 3A-B show an example of a generator used in a machine classifierin accordance with one or more aspects described herein;

FIG. 3C shows an example of a machine learning classifier in accordancewith one or more aspects described herein;

FIG. 4A shows a flow chart of a process for training a machineclassifier according to one or more aspects of the disclosure;

FIG. 4B shows a pseudocode representation of a process for training amachine classifier according to one or more aspects of the disclosure;and

FIG. 5 shows a flow chart of a process for classifying data according toone or more aspects of the disclosure.

DETAILED DESCRIPTION

In the following description of the various embodiments, reference ismade to the accompanying drawings, which form a part hereof, and inwhich is shown by way of illustration various embodiments in whichaspects of the disclosure may be practiced. It is to be understood thatother embodiments may be utilized and structural and functionalmodifications may be made without departing from the scope of thepresent disclosure. Aspects of the disclosure are capable of otherembodiments and of being practiced or being carried out in various ways.In addition, it is to be understood that the phraseology and terminologyused herein are for the purpose of description and should not beregarded as limiting. Rather, the phrases and terms used herein are tobe given their broadest interpretation and meaning.

By way of introduction, aspects discussed herein may relate to methodsand techniques for machine classifiers using adversarial learningtechniques. Recent advances in deep neural network architectures haveenabled tremendous success on a number of difficult machine learningproblems. Deep recurrent neural networks (RNNs) in particular areachieving impressive results in a number of tasks involving thegeneration of sequential structured outputs, including languagemodeling, machine translation, image tagging, visual and languagequestion and answering, and speech recognition. While these results areimpressive, producing a neural network-based conversation model that canengage in open domain discussion remains elusive. A dialogue systemneeds to be able to generate meaningful and diverse responses that aresimultaneously coherent with the input utterance and the overalldialogue topic. Unfortunately, earlier conversation models trained withnaturalistic dialogue data suffered greatly from limited contextualinformation and lack diversity. These problems often lead to generic andsafe utterance in response to varieties of input utterance.

Hierarchical Recurrent Encoder-Decoder (HRED) architectures are capableof capturing long temporal dependencies in multi-turn conversations toaddress the limited contextual information but the diversity problemremained. Some HRED variants such as variational and multi-resolutionHREDs attempt to alleviate the diversity problem by injecting noise atthe utterance level and by extracting additional context used tocondition the generator. While these approaches achieve certain measuresof success over the basic HRED, generated responses are still mostlygeneric since they do not control the generator's output. Similarly,diversity promoting training objectives can be used for single turnconversations, but cannot be trained end-to-end and therefore areunsuitable for multi-turn dialog modeling.

Machine classifiers in accordance with embodiments of the invention canbe used to generate data based on a variety of input data, such asmulti-turn dialog datasets. The machine classifiers can use a HREDarchitecture using generative adversarial networks (GAN) to compensatefor exposure bias. A GAN matches data from two different distributionsby introducing an adversarial game between a generator and adiscriminator. A generator can, given an observable variable X and atarget variable Y, generate a statistical model of the joint probabilitydistribution on X×Y, P (X, Y). A discriminator can generate model of theconditional probability of the target Y, given an observation x, P(Y|X=x). In dialogue generation, generated responses should beconsistent with the dialogue input and overall topic. Employingconditional GANs for multi-turn dialogue models with a HRED generatorand discriminator combines both generative and retrieval-basedmulti-turn dialogue systems to improve their individual performances.This is achieved by sharing the context and word embedding between thegenerator and the discriminator, thereby allowing for joint end-to-endtraining using back-propagation.

Machine classifiers in accordance with embodiments of the invention mayemploy autoregressive sampling and/or use the dense conditionalprobability over the vocabulary as an attention over the word embedding.This can improve the performance of machine classifiers over prior arttechniques that use the categorical output of the generator decoder.Machine classifiers may further backpropagate the adversarial lossthrough the decoder, the encoder and finally the word embedding. Thiscomplete end-to-end backpropagation alleviates the training difficultywith autoregressive sampling of combined word- and utterance-leveldiscrimination. Further, the utterance-level discrimination can capturenuanced semantic difference between the generated response and theground truth that might be missed by the word level discrimination.Machine classifiers may also use negative sampling in the discriminatortraining to further improve the quality of the adversarial weight updateprovided by the discriminator to the generator.

In a variety of embodiments, machine classifiers may model multi-turndialogue as a one-to-many prediction task. A multi-turn dialog caninclude one or more conversation turns indicating a user utterance and aresponse to that utterance. In several embodiments, a conversation turnincludes a variety of other metadata, such as an identification of theuser and/or the responder, as appropriate to the requirements of aspectsof the disclosure. The machine classifier may be trained usingadversarial bootstrapping between a generator and a discriminator withmulti-turn capabilities. The machine classifiers may be trained in bothauto-regressive and traditional teacher-forcing modes, with thegenerator including a hierarchical recurrent encoder-decoder network andthe discriminator including a bi-directional recurrent neural network.The discriminator input may include a mixture of ground truth labels,the teacher-forcing outputs of the generator, and/or noise data. Thismixture of input data may allow for richer feedback on theautoregressive outputs of the generator. The outputs can be ranked basedon the discriminator feedback and a response selected from the rankedoutputs.

Operating Environments and Computing Devices

FIG. 1 shows an operating environment 100. The operating environment 100may include at least one client device 110, at least one task serversystem 130, and/or at least one classification server system 120 incommunication via a network 140. It will be appreciated that the networkconnections shown are illustrative and any means of establishing acommunications link between the computers may be used. The existence ofany of various network protocols such as TCP/IP, Ethernet, FTP, HTTP andthe like, and of various wireless communication technologies such asGSM, CDMA, WiFi, and LTE, is presumed, and the various computing devicesdescribed herein may be configured to communicate using any of thesenetwork protocols or technologies. Any of the devices and systemsdescribed herein may be implemented, in whole or in part, using one ormore computing systems described with respect to FIG. 2.

Client devices 110 may provide data and/or interact with a variety ofmachine classifiers as described herein. Classification server systems120 may store, train, and/or provide a variety of machine classifiers asdescribed herein. Task server systems 130 may exchange data with clientdevices 110, provide training data to the classification server systems120, provide input data to the classification server systems 120 forclassification, and/or obtain classified data from the classificationserver systems 120 as described herein. However, it should be noted thatany computing device in the operating environment 100 can perform any ofthe processes and/or store any data as described herein. The task serversystems 130 and/or classification server systems 120 may be publiclyaccessible and/or have restricted access. Access to a particular serversystem may be limited to particular client devices 110. Some or all ofthe data described herein may be stored using one or more databases.Databases may include, but are not limited to relational databases,hierarchical databases, distributed databases, in-memory databases, flatfile databases, XML databases, NoSQL databases, graph databases, and/ora combination thereof. The network 140 may include a local area network(LAN), a wide area network (WAN), a wireless telecommunications network,and/or any other communication network or combination thereof.

The data transferred to and from various computing devices in operatingenvironment 100 may include secure and sensitive data, such asconfidential documents, customer personally identifiable information,and account data. Therefore, it may be desirable to protecttransmissions of such data using secure network protocols andencryption, and/or to protect the integrity of the data when stored onthe various computing devices. A file-based integration scheme or aservice-based integration scheme may be utilized for transmitting databetween the various computing devices. Data may be transmitted usingvarious network communication protocols. Secure data transmissionprotocols and/or encryption may be used in file transfers to protect theintegrity of the data such as, but not limited to, File TransferProtocol (FTP), Secure File Transfer Protocol (SFTP), and/or Pretty GoodPrivacy (PGP) encryption. In many embodiments, one or more web servicesmay be implemented within the various computing devices. Web servicesmay be accessed by authorized external devices and users to supportinput, extraction, and manipulation of data between the variouscomputing devices in the operating environment 100. Web services builtto support a personalized display system may be cross-domain and/orcross-platform, and may be built for enterprise use. Data may betransmitted using the Secure Sockets Layer (SSL) or Transport LayerSecurity (TLS) protocol to provide secure connections between thecomputing devices. Web services may be implemented using the WS-Securitystandard, providing for secure SOAP messages using XML encryption.Specialized hardware may be used to provide secure web services. Securenetwork appliances may include built-in features such ashardware-accelerated SSL and HTTPS, WS-Security, and/or firewalls. Suchspecialized hardware may be installed and configured in the operatingenvironment 100 in front of one or more computing devices such that anyexternal devices may communicate directly with the specialized hardware.

Turning now to FIG. 2, a conceptual illustration of a computing device200 that may be used to perform any of the techniques as describedherein is shown. The computing device 200 may include a processor 203for controlling overall operation of the computing device 200 and itsassociated components, including RAM 205, ROM 207, input/output device209, communication interface 211, and/or memory 215. A data bus mayinterconnect processor(s) 203, RAM 205, ROM 207, memory 215, I/O device209, and/or communication interface 211. In some embodiments, computingdevice 200 may represent, be incorporated in, and/or include variousdevices such as a desktop computer, a computer server, a mobile device,such as a laptop computer, a tablet computer, a smart phone, any othertypes of mobile computing devices, and the like, and/or any other typeof data processing device.

Input/output (I/O) device 209 may include a microphone, keypad, touchscreen, and/or stylus through which a user of the computing device 200may provide input, and may also include one or more of a speaker forproviding audio output and a video display device for providing textual,audiovisual, and/or graphical output. Software may be stored withinmemory 215 to provide instructions to processor 203 allowing computingdevice 200 to perform various actions. Memory 215 may store softwareused by the computing device 200, such as an operating system 217,application programs 219, and/or an associated internal database 221.The various hardware memory units in memory 215 may include volatile andnonvolatile, removable and non-removable media implemented in any methodor technology for storage of information such as computer-readableinstructions, data structures, program modules, or other data. Memory215 may include one or more physical persistent memory devices and/orone or more non-persistent memory devices. Memory 215 may include, butis not limited to, random access memory (RAM) 205, read only memory(ROM) 207, electronically erasable programmable read only memory(EEPROM), flash memory or other memory technology, optical disk storage,magnetic cassettes, magnetic tape, magnetic disk storage or othermagnetic storage devices, or any other medium that may be used to storethe desired information and that may be accessed by processor 203.

Communication interface 211 may include one or more transceivers,digital signal processors, and/or additional circuitry and software forcommunicating via any network, wired or wireless, using any protocol asdescribed herein. It will be appreciated that the network connectionsshown are illustrative and any means of establishing a communicationslink between the computers may be used. The existence of any of variousnetwork protocols such as TCP/IP, Ethernet, FTP, HTTP and the like, andof various wireless communication technologies such as GSM, CDMA, WiFi,and LTE, is presumed, and the various computing devices described hereinmay be configured to communicate using any of these network protocols ortechnologies.

Processor 203 may include a single central processing unit (CPU), whichmay be a single-core or multi-core processor, or may include multipleCPUs. Processor(s) 203 and associated components may allow the computingdevice 200 to execute a series of computer-readable instructions toperform some or all of the processes described herein. Although notshown in FIG. 2, various elements within memory 215 or other componentsin computing device 200, may include one or more caches including, butnot limited to, CPU caches used by the processor 203, page caches usedby the operating system 217, disk caches of a hard drive, and/ordatabase caches used to cache content from database 221. For embodimentsincluding a CPU cache, the CPU cache may be used by one or moreprocessors 203 to reduce memory latency and access time. A processor 203may retrieve data from or write data to the CPU cache rather thanreading/writing to memory 215, which may improve the speed of theseoperations. In some examples, a database cache may be created in whichcertain data from a database 221 is cached in a separate smallerdatabase in a memory separate from the database, such as in RAM 205 oron a separate computing device. For instance, in a multi-tieredapplication, a database cache on an application server may reduce dataretrieval and data manipulation time by not needing to communicate overa network with a back-end database server. These types of caches andothers may be included in various embodiments, and may provide potentialadvantages in certain implementations of devices, systems, and methodsdescribed herein, such as faster response times and less dependence onnetwork conditions when transmitting and receiving data.

Although various components of computing device 200 are describedseparately, functionality of the various components may be combinedand/or performed by a single component and/or multiple computing devicesin communication without departing from the invention.

Any data described and/or transmitted herein may include secure andsensitive data, such as confidential documents, customer personallyidentifiable information, and account data. Therefore, it may bedesirable to protect transmissions of such data using secure networkprotocols and encryption, and/or to protect the integrity of the datawhen stored on the various computing devices. For example, a file-basedintegration scheme or a service-based integration scheme may be utilizedfor transmitting data between the various computing devices. Data may betransmitted using various network communication protocols. Secure datatransmission protocols and/or encryption may be used in file transfersto protect the integrity of the data, for example, File TransferProtocol (FTP), Secure File Transfer Protocol (SFTP), and/or Pretty GoodPrivacy (PGP) encryption. In many embodiments, one or more web servicesmay be implemented within the various computing devices. Web servicesmay be accessed by authorized external devices and users to supportinput, extraction, and manipulation of data between the variouscomputing devices in the system 200. Web services built to support apersonalized display system may be cross-domain and/or cross-platform,and may be built for enterprise use. Data may be transmitted using theSecure Sockets Layer (SSL) or Transport Layer Security (TLS) protocol toprovide secure connections between the computing devices. Web servicesmay be implemented using the WS-Security standard, providing for secureSOAP messages using XML encryption. Specialized hardware may be used toprovide secure web services. For example, secure network appliances mayinclude built-in features such as hardware-accelerated SSL and HTTPS,WS-Security, and/or firewalls. Such specialized hardware may beinstalled and configured in the system 200 in front of one or morecomputing devices such that any external devices may communicatedirectly with the specialized hardware.

Machine Classifiers and Processes

Machine classifiers can make predictions for responses to a multi-turndialog task based on previous turns in the dialog. The generator of themachine classifier can include three structures—encoder (eRNN), context(cRNN), and decoder (dRNN) neural networks. In several embodiments,these structures are based on a recurrent neural network (RNN)architecture, although any architecture can be used as appropriate. Agenerator can make predictions conditioned on a dialogue history h_(i),attention A_(i) ^(j), noise sample Z_(i) ^(i), and ground truth X_(i+i)^(j−1). The discriminator conditioned on h_(i) distinguishes between thegenerated output {Y_(i) ^(j)}j=1 ^(M) ^(i+1) and ground truth {X_(i+1)^(j)}_(j=1) ^(M) ^(i+1) . The machine classifier can also include adiscriminator. The discriminator may include a RNN architecture. Thediscriminator can discriminate bidirectionally at the word level. Themachine classifier can also include an attention RNN. The attention RNNensures local relevance while the context RNN ensures global relevance.Their states are combined to initialize the decoder RNN and thediscriminator BiRNN.

A machine classifier being trained in an asymmetric mode may minimizethe discriminator loss of the generator G based on the autoregressiveoutputs of the generator while the discriminator tries to maximize thediscriminator loss based on the teacher forcing outputs of thegenerator. During training, the generator parameters may besimultaneously trained using the teacher forcing maximum likelihoodestimator (MLE) and with the adversarial loss from the samples generatedautoregressively. The teacher forcing MLE may set a lower boundperformance with single step look ahead for the generator while theautoregressive adversarial component allows for a multi-step look ahead.The generator parameters may be shared for both teacher forcing MLE andautoregressive adversarial updates.

FIGS. 3A-B show an example of a machine classifier in accordance withone or more aspects described herein. The generator portion of machineclassifier 300 includes four RNNs with different parameters: encoder RNN(eRNN) 310, 320, context RNN (cRNN) 312, 322, attention RNN (aRNN) 314,324, and decoder RNN (dRNN) 316, 326. In several embodiments, aRNN 314,324 and eRNN 310, 320 are bidirectional while cRNN 312, 322 and dRNN316, 326 are unidirectional. Each RNN can have three layers and a hiddenstate size of 512. However, it should be noted that any number of layersand any hidden state size can be used as appropriate. dRNN and aRNN canbe connected using an additive attention mechanism. The conditionalprobability P_(θ) _(G) modeled by the generator G per output word tokencan be expressed as:

-   -   P_(θ) _(G) (Y_(i) ^(j)|X_(i+1) ^(1:j−1), X_(i))=dRNN(E(X_(i+1)        ^(j−1)),b_(i) ^(j−1),h_(i))

where θ_(G) is the parameters of G, E(·) is the embedding lookup,h_(i)=cRNN(eRNN(E(X_(i)), h_(i−1)), eRNN(·) maps a sequence of inputsymbols into fixed-length vector, and h and h are the hidden states ofdRNN and cRNN. The conditional probability, accounting for noise samplesZ_(i) ^(j) can be expressed as

-   -   P_(θ) _(G) (Y_(i) ^(j)|X^(1:j−1),Z_(i) ^(j),        X_(i))=dRNN(E(x^(j−1)),h_(i) ^(j−1),A_(i) ^(j),Z_(i) ^(j),h_(i))

The output 311 of eRNN 310 may be used to initialize cRNN 312. Theinitial state 315 of the generator may be shared with the initial state316 of the discriminator. In several embodiments, the discriminatorshares context and/or word embeddings with the generator. The generatoroutput 317 may be provided to and used by the discriminator to generatediscriminator output 318. The output 321 of eRNN 320 may be used toinitialize cRNN 322. The initial state 325 of the generator may beshared with the initial state 326 of the discriminator. In severalembodiments, the discriminator shares context and/or word embeddingswith the discriminator. The generator output 327 may be provided to andused by the discriminator to generate discriminator output 328.

The generator may be conditioned on the dialog history h_(i), attentionA_(i) ^(j), noise sample Z_(i) ^(j), and one or both of the generatedoutput (Y_(i) ^(j))_(j=1) ^(M) ^(i+i) and ground truth (X_(i+1)^(j))_(j=1) ^(M) ^(i+1) . The discriminator may be conditioned on h_(i)and may distinguish between the generated output {Y_(i) ^(j)}_(j=1) ^(M)^(i+i) and the ground truth {X_(i+1) ^(j)}_(j=1) ^(M) ^(i+1) .

A variety of existing machine classifiers process high-level tokens byextracting and processing the tokens by another RNN. Machine learningclassifiers in accordance with embodiments of the invention cancircumvent the need for this extra processing by allowing dRNN to attendto different parts of the input utterance during response generation.Local attention can be used and the attention memory can be encodeddifferently from the context through aRNN, yielding:

-   -   P_(θ) _(G) (Y_(i) ^(j)|X_(i+1) ^(1:j−1), X_(i))=dRNN(E(X_(i+1)        ^(j−1)),h_(i) ^(j−1),A_(i) ^(j),h_(i))

where

$A_{i}^{j} = {\sum\limits_{m = 1}^{M_{i}}{\frac{\exp\left( \alpha_{m} \right)}{\sum\limits_{m = 1}^{M_{i}}{\exp\left( \alpha_{m} \right)}}h_{i}^{\prime m}}}$h_(i)^(′m) = aRNN(E(X_(i)^(m)), h_(i)^(′m − 1))

h′ is the hidden state of aRNN 314, and a_(k) is a logit projection of,depending on implementation, either of:

-   -   (h_(i) ^(j−1),h′_(i) ^(m))    -   (h_(i) ^(j−1))T·h′_(i) ^(m)

For teacher forcing mode, X=X_(i+1), while X=Y_(i) for theautoregression.

Noise, such as Gaussian noise, can be injected at the input of dRNN.Noise samples can be injected at the utterance or word level. With noiseinjection, the conditional probability of the decoder output becomes

-   -   P_(θ) _(G) (Y_(i) ^(j)|X_(i+1) ^(1:j−1),Z_(i) ^(j),        X_(i))=dRNN(E(X_(i+1) ^(j−1),A_(i) ^(j),Z_(i) ^(j),h_(i))

where utterance-level noise and word-level noise are, respectively:

-   -   Z_(i) ^(j)˜        _(i)(0,I)    -   Z_(i) ^(j)˜        _(i) ^(j)(0,I)

FIG. 3C shows an example of a machine classifier in accordance with oneor more aspects described herein. The machine classifier 350 includesgenerator 360 and discriminator 362. The generator 360 can take inputdata 364 and generate candidate responses, which are provided todiscriminator 362. Discriminator 362 can generate a probability thateach candidate response corresponds to ground truth data 366. Theprobability generated by discriminator 362 can be used to rank thecandidate responses to generate ranked response list 368. Additionally,the probabilities generated by discriminator 362 can be provided togenerator 360 to autoregressively update the parameters of generator360.

The discriminator 362 can share aRNN, eRNN, and cRNN with the generator360. Discriminator 362 can also include a word discriminator D_(WRNN).D_(WRNN) can be a stacked bidirectional RNN with three layers and ahidden state size of 512, although any number of layers and hidden statesize can be used as appropriate. The cRNN states can be used toinitialize the states of D_(WRNN). The output of both the forward andthe backward cells for each word can be concatenated and passed to afully connected layer with binary output. The output of thediscriminator can include the probability that the word is from theground truth given the past and future words.

For utterance-level discrimination, D_(URNN) may be a unidirectional RNNwith 3 layers and a hidden state size of 512, although any number oflayers and hidden state size can be used as appropriate. D_(URNN) may beinitialized by the states of cRNN. The final output of the RNN may bepassed to a fully connected layer with binary output. The output may bethe probability that the input utterance is from the ground truth.

The discriminator 362 can share context and word embedding with thegenerator 360 and can discriminate at the word level. Word-leveldiscrimination can be achieved through a bidirectional RNN and is ableto capture both syntactic and conceptual differences between thegenerator output and the ground truth. The aggregate classification ofan input sequence X can be factored over word-level discrimination andexpressed as

${D\left( {X_{i},\chi} \right)} = {{D\left( {h_{i},\chi} \right)} = \left\lbrack {\prod\limits_{j = 1}^{J}{D_{RNN}\left( {h_{i},{E\left( \chi^{j} \right)}} \right)}} \right\rbrack^{\frac{1}{J}}}$

where D_(RNN)(·) is the word discriminator RNN, h_(i) is an encodedvector of the dialogue history X_(i) obtained from generator 360 cRNN(·)output, and j is the jth word or token of the input sequence X·X=Y_(i)and J=T_(i) for the case of generator's decoder output, X=X_(i−1) andJ=M_(i+1) for the case of ground truth, and M_(i) is a word token in theinput sequence.

While the word-level discrimination performs well at returning moresyntactically informative gradient, the effectiveness of the semanticinformation provided to the generator might be limited as it might notcapture beyond word substitutes and co-occurrence. This behavior is goodfor overcoming the exposure bias problem. However, the discriminator mayalso return gradient based on the meaning of the response beyond wordsubstitutes and co-occurrence. This can be achieved by combining anutterance-level discrimination with the word-level discrimination,yielding a multi-resolution discrimination. The combined discriminatoroutput can be expressed as:

-   -   D(X_(i),χ)=λ_(D)D_(U)(h_(i),χ)+(1−λ_(D))D_(W)(h_(i),χ)

where λ_(D) is a hyperparameter, D_(U) is a utterance-leveldiscrimination, and D_(w) is a word-level discrimination. In a number ofembodiments, the utterance-level discrimination may be based on aconvolutional neural network.

The discriminator input may be a word embedding based on the generator'soutput or the ground truth. To allow for end-to-end backpropagationthrough the generator and discriminator, the input word embeddings maybe shared between the generator and the discriminator, which may beexpressed as:

-   -   P_(θ) _(G) (Y_(i) ^(j)|Y^(1:j−1),Z_(i) ^(j),        X_(i))=softmax(h_(i) ^(j)E)

where E=

^(V×n) is the word embedding matrix, Vis the vocabulary size, and n isthe embedding dimension.

Machine learning classifiers can generate a ranked list of candidateresponses for input data including a multi-turn dialog sequence. Amulti-turn dialogue dataset can include a sequence of N utterances

-   -   X=(X₁, X₂, . . . , X_(N))

where each utterance

-   -   X_(i)=(X_(i) ¹, X_(i) ², . . . , X_(i) ^(M) ^(i) )

contains a variable-length sequence of M word tokens such that

-   -   X_(i) ^(j)εV

for vocabulary V. At any time step i, the dialogue history can beexpressed as:

-   -   X_(i)=(X₁, X₂, . . . , X_(i))

The dialogue response generation task can be defined as, given adialogue history X_(i), generate a response

-   -   Y_(i)=(Y_(i) ¹, Y_(i) ², . . . , Y_(i) ^(T) ^(i) )

where T_(i) is the number of generated tokens.

In order to provide realistic responses, the distribution of theresponses generated by the machine classifier (P(Y_(i))) should beindistinguishable from that of the ground truth P(X_(i+1)) andT_(i)=M_(i+1). Machine classifiers can be condition response generationon dialogue history and generate dialog responses that are statisticallysimilar to the ground truth response distributions. Machine classifierscan learn a mapping from an observed dialogue history X_(i) and asequence of random noise vectors Z_(i) to a sequence of output tokens

-   -   Y_(i),G:{X_(i),Z_(i)}→Y_(i)

The generator G can be trained to produce output sequences that arestatistically similar to the ground truth sequence by an adversariallytrained discriminator D that is trained to do well at detecting fakesequences. The distribution of the generator output sequence can befactored by the product rule:

${P\left( {Y_{i}❘X_{i}} \right)} = {{P\left( Y_{i}^{1} \right)}{\prod\limits_{j = 2}^{T_{i}}{P\left( {{Y_{i}^{j}❘Y_{i}^{1}},\ldots\mspace{14mu},Y_{i}^{j - 1},X_{i}} \right)}}}$

-   -   P(Y_(i) ^(j)|Y_(i) ¹, . . . , Y_(i) ^(j−1), X_(i))=P_(θ) _(G)        (Y_(i) ^(1:j−1), X_(i))

where

-   -   Y_(i) ^(i:j−1)=(Y_(i) ¹, . . . , Y_(i) ^(j−1))

and θ_(G) are the parameters of the generator model.

The generator can use an autoregressive generative model where theprobability of the current token depends on the past generated sequence:

-   -   P_(θ) _(G) (Y_(i) ^(i:j−1), X_(i))

To stabilize the training of the generator G is unstable in practice,the past generated sequence may be substituted with the ground truth:

-   -   P(Y_(i) ^(j)|Y_(i) ¹, . . . , Y_(i) ^(j−1), X_(i))≈P_(θ) _(G)        (X_(i+1) ^(1:j−1), X_(i))

In a variety of embodiments, the substitution of past-generatedsequences with ground truth data in the training of a machine classifieris known as training the machine classifier using a teacher forcingmode. In several embodiments, the ground truth data can be perturbedusing noise data to generate a fake sample. A fake sample can bedescribed as the teacher forcing output with some input noise Z_(i):

-   -   Y_(i) ^(j)˜P_(θ) _(G) (X_(i+1) ^(1:j−1), X_(i),Z_(i))

and the corresponding real sample as ground truth X_(i+1) ^(j).

The objective function of the generator can be used to match the noisedistribution P(Z_(i)) to the distribution of the ground truth responseP(X_(i+1)|X_(i)). Varying the noise input then allows for the generationof diverse responses to the same dialogue history. Furthermore, thediscriminator can used during inference to rank the generated responses,providing a means of controlling the generator output.

In several embodiments, an asymmetric configuration may be used wherethe teacher forcing samples are used to train the discriminator and theautoregressive output is used for the generator's adversarial updates,which may be expressed as:

-   -   Y_(i) ^(j)˜P_(θ) _(G) (Y_(i+1) ^(1:j−1), X_(i),Z_(i))

Negative sampling can also be used to better train the discriminator,thereby providing better adversarial feedback to the generator. Theobjective of the generator can be expressed as

-   -   _(cGAN)(G,D)=        X_(i), X_(i+1)[log D(X_(i+1), X_(i))]+        _(X) _(i) _(,Z) _(i) [1−log D(G(X_(i),Z_(i)), X_(i)]

where generator G tries to minimize this objective against anadversarial discriminator D that tries to maximize it:

$G^{*},{D^{*} = {\arg\;{\min\limits_{G}\;{\max\limits_{D}{{\mathcal{L}_{cGAN}\left( {G,D} \right)}.}}}}}$

It can be beneficial to mix the objective with a more traditional loss

such as, but not limited to, cross-entropy loss and/or maximumlikelihood estimation (MLE). The discriminator's job remains unchanged,but the generator is tasked not only to fool the discriminator but alsoto be near the ground truth X_(i+1) in the cross-entropy sense:

-   -   _(MLE)(G)=        _(X) _(i) _(, X) _(i+1) _(,Z) _(i) [−log P_(θ) _(G) (X_(i+1),        X_(i),Z_(i))].

Machine classifiers using an asymmetric adversarial learning objectivemay separate the optimization of the generator and the discriminator.The objective function with the addition of negative samples X_(i+1) ⁻can be expressed as:

$G^{*},{D^{*} = {\arg\;{\min\limits_{G}\;{\max\limits_{D}\left( {{\lambda_{G}{\mathcal{L}_{G}\left( {G,D} \right)}} + {\lambda_{D}{\mathcal{L}_{D}\left( {G,D} \right)}} + \left( {\lambda_{M}{\mathcal{L}_{MLE}(G)}} \right)} \right.}}}}$

where λ_(G), λ_(D), and λ_(M) are training hyperparameters. In a varietyof embodiments,

ℒ_(G)(G, D) = 𝔼_(X_(i), Z_(i))[1 − log  D(Y_(i), X_(i))] $\begin{matrix}{{\mathcal{L}_{D}\left( {G,D} \right)} = {{\mathbb{E}}_{X_{i},X_{i + 1}}\left\lbrack {\log\;{D\left( {X_{i + 1},X_{i}} \right)}} \right\rbrack}} \\{+ {{\mathbb{E}}_{X_{i},Z_{i}}\left\lbrack {1 - {\log\;{D\left( {{\overset{\sim}{Y}}_{i},X_{i}} \right)}}} \right\rbrack}} \\\left. {+ {{\mathbb{E}}_{X_{i},Z_{i}}\left\lbrack {{1 - {\log\;{D\left( {X_{i + 1}^{-},Z_{i}} \right)}}},X_{i}} \right)}} \right\rbrack\end{matrix}$

In many embodiments, the machine classifier is trained to generatemappings from X_(i) to Y_(i) without nose Z_(i). However, this canresult in a trained machine classifier that produces deterministicoutputs and fails to match any statistical similarity to the groundtruth other than a delta function.

Machine classifiers may pass the ground truth and generator output tothe discriminator with a label of 1 and 0 respectively. To improve theadversarial loss, a random sample of the input data (e.g. training data){tilde over (X)}_(i+1) can be provided to the discriminator with a labelof 0. These negative samples may improve the feedback generated by thediscriminator (and provided to the generator) by making thediscrimination task more difficult. In a variety of embodiments, theinclusion of negative samples in the training of the discriminatorimproves the accuracy of the machine classifier by approximately 10%.

FIG. 4 shows a flow chart of a process for training a machine classifieraccording to one or more aspects of the disclosure. Some or all of thesteps of process 400 may be performed using one or more computingdevices as described herein. In a variety of embodiments, some or all ofthe steps described below may be combined and/or divided into sub-stepsas appropriate.

At step 410, training data can be obtained. The training data caninclude a number of dialog sequences in a multi-turn dialog. Each dialogsequence can include one or more word tokens. The training data can alsoinclude a current prompt to which the machine classifier is beingtrained to generate a response. The current prompt can include one ormore word tokens indicating a statement, question, or other dialog step.The training data can also include a ground truth response to thecurrent prompt. In several embodiments, a subset of the training data isused to train the machine classifier. The subset of training data can berandomly sampled from the training data and/or selected based onparticular characteristics of certain examples in the training data. Forexample, if the machine classifier is being trained to identify aparticular feature in input data, the examples having that particularfeature may be included in the subset of training data. The trainingdata can include both teacher forcing samples and/or autoregressiondata. The teacher forcing samples may correspond to particular pieces ofthe autoregression data. For example, the autoregression data mayinclude an output generated by the generator for a particular input andthe corresponding teacher forcing sample is the ground truth label forthe input. Any selected subsets can include subsets of the teacherforcing samples, the autoregression data, and/or a combination.

At step 412, the context of the machine classifier can be updated. Thecontext of the machine classifier can indicate the current history of amulti-turn dialog such that the current prompt to which the machineclassifier is generating a response. This can allow the machineclassifier to generate a relevant response (e.g. a response that isstatistically similar to the ground truth response) within the contextof the entire dialog, not just the current prompt. The context of themachine classifier can be updated based on a subset of the dialogsequences in the training data.

At step 414, a generator output can be determined. The generator outputcan be determined by a generator portion of the machine classifier. Thegenerator output can be determined based on the context and the currentprompt. In a variety of embodiments, the generator output is determinedbased on the current model parameters for the generator, the context,and the current prompt. The generator output can include one or moreword tokens forming a response to the current prompt.

At step 416, discriminator accuracy can be determined. The discriminatoraccuracy can be determined by a discriminator portion of a machineclassifier. The discriminator accuracy can indicate the statisticalsimilarity between the generator output and the ground truth response tothe current prompt. In a variety of embodiments, the discriminatoraccuracy includes a probability indicating the likelihood that thegenerator output is the ground truth response to the current promptbased on the context.

At step 418, model parameters can be updated. The model parameters canbe updated based on the context, the current prompt, the generatedresponse, the ground truth response, and the discriminator accuracy. Byupdating the model parameters, the generator can be trained to generateresponses that are more accurate. In several embodiments, the modelparameters are updated using an autoregression weighted by thediscriminator accuracy. In a variety of embodiments, the ground truthresponse is injected with noise as described herein. In severalembodiments, it is desirable to keep the discriminator from becoming toogood at discriminating the output of the generator such that theadversarial loss is not useful to the generator. At the same time, it isdesirable to keep the performance of the discriminator above a thresholdlevel such that the loss provided to the generator is useful. To ensurethe performance of the discriminator, the discriminator may only beupdated when its accuracy is below a discriminator performance thresholdvalue (e.g. acc_(D) _(th) ), such as 0.99. When the discriminator lossis below a generator performance threshold value (e.g. acc_(G) _(th) ),such as 0.75, the generator may be updated based on the teacher forcingMLE loss. Otherwise, the generator may be updated based on theautoregressive adversarial loss, either alone or in combination with theteacher forcing MLE loss. It should be noted that any threshold valuescan be used for updating the model parameters of the discriminatorand/or generator as appropriate.

In a variety of embodiments, the algorithm 450 shown in FIG. 4B can beused to train a machine classifier having a generator G with parametersθ_(G) and a discriminator D with parameters θ_(D). Some or all of thesteps of process 450 may be performed using one or more computingdevices as described herein. In a variety of embodiments, some or all ofthe steps of process 450 may be combined and/or divided into sub-stepsas appropriate.

Once trained, a machine classifier can be used to generate responses toa prompt in a multi-turn dialog task. The generation objective can bemathematically described as

$\left. {Y_{i}^{*} = {\arg\;{\max\limits_{l}\left\{ {{P\left( {Y_{i,l}❘X_{i}} \right)} + {D^{*}\left( {X_{i}❘X_{i,l}} \right)}} \right\rbrack}}} \right\}_{l = 1}^{L}$

where

-   -   Y_(i,l)=G*(X_(i),Z_(i,l))

Z_(i,l) is the lth noise samples at dialogue step i, and L is the numberof response samples.

In a variety of embodiments, the inference objective of thediscriminator is the same as the training objective of the generator,combining both the MLE and adversarial criteria. This is in contrast toexisting classifiers where the discriminator is usually discarded duringinference. In several embodiments, an approximate solution can becalculated for the inference objective based on greedy decoding (e.g.MILE) on the to generate L lists of responses based on noise samples

-   -   (Z_(i,l))_(l=1) ^(L)

In order to facilitate the exploration of the generator's latent space,we sample a modified noise distribution,

-   -   Z_(i,l) ^(j)˜        _(i,l)(0,αI)

or

-   -   Z_(i,l) ^(j)˜        _(i,l) ^(j)(0,αI)) where α>1.0

where α is the exploration factor that increases the noise variance. Theresponse samples L can be ranked using the discriminator score

-   -   {D*(X_(i), Y_(i,l))]}_(l=1) ^(L)

The response with the highest discriminator ranking can be considered asthe optimum response for the dialogue context.

FIG. 5 shows a flow chart of a process for classifying data according toone or more aspects of the disclosure. Some or all of the steps ofprocess 500 may be performed using one or more computing devices asdescribed herein. In a variety of embodiments, some or all of the stepsdescribed below may be combined and/or divided into sub-steps asappropriate.

At step 510, input data can be obtained. The input data can include anumber of dialog sequences in a multi-turn dialog. Each dialog sequencecan include one or more word tokens. The input data can also include acurrent prompt to which the machine classifier is being asked togenerate a response. The current prompt can include one or more wordtokens indicating a statement, question, or other dialog step. Thedialog sequences may belong to a particular class of tasks. A class oftask can indicate a particular domain or other context in which amachine classifier can be trained to generate responses. For example,the input data can be the chat history for an online chat session for acustomer service application, and the current prompt can indicate thelast question asked by a customer. The input data can be provided to amachine classifier trained to generate responses for a particular classof task. For example, a machine classifier trained to respond toquestions in a customer service context may be different from a machineclassifier trained to respond to questions in a medical context.

At step 512, a candidate response list can be generated. The machineclassifier can generate the candidate response list based on the dialogsequences and the current prompt. In a variety of embodiments, themachine classifier generates the response using a generator having a setof model parameters determined during the training of the machineclassifier as described herein. In several embodiments, the candidateresponse list includes one or more candidate responses to the currentprompt in the context of the dialog sequences. In a variety ofembodiments, each candidate response list includes a confidence metric,calculated by the generator, indicating the likelihood that thecandidate response corresponds to a ground truth response to the currentprompt.

At step 514, a discriminator score can be calculated for each candidateresponse. In many embodiments, the discriminator score for a candidateresponse is calculated based on the dialog sequences, the currentprompt, and the candidate response. The discriminator score can bedetermined by a discriminator of the machine classifier. Thediscriminator score can indicate the statistical similarity between thecandidate response and an anticipated ground truth response to thecurrent prompt. In a variety of embodiments, the discriminator scoreincludes a probability indicating the likelihood that the candidateresponse is the ground truth response to the current prompt based on thecontext. The ground truth response to the current prompt can indicate aresponse that is informative, relevant, or both. The discriminator scorecan be calculated for each candidate response in the candidate responselist. In a number of embodiments, the discriminator score is onlycalculated for candidate responses having a confidence metric exceedinga threshold value.

At step 516, the candidate response list can be ranked. The candidateresponse list can include the set of candidate responses ordered by thediscriminator score for each candidate response. At step 518, a responsecan be selected. In many embodiments, the selected response is thecandidate response having the highest (or lowest) discriminator score.That is, the selected response is the candidate response that isindicated by the machine classifier to be closest to a ground truthresponse for the current prompt.

At step 520, the response can be transmitted. The response can betransmitted to any computing device displaying and/or generating theinput data. For example, in the customer service context, the candidateresponse may be transmitted as a real-time chat response in a chatwindow displayed using a web browser of the customer's computer. In themedical context, the response can be transmitted via email. However, anytechnique for transmitting and displaying responses, such as a pushnotification to a mobile device, can be used as appropriate.

One or more aspects discussed herein may be embodied in computer-usableor readable data and/or computer-executable instructions, such as in oneor more program modules, executed by one or more computers or otherdevices as described herein. Generally, program modules includeroutines, programs, objects, components, data structures, and the likethat perform particular tasks or implement particular abstract datatypes when executed by a processor in a computer or other device. Themodules may be written in a source code programming language that issubsequently compiled for execution, or may be written in a scriptinglanguage such as (but not limited to) HTML or XML. The computerexecutable instructions may be stored on a computer readable medium suchas a hard disk, optical disk, removable storage media, solid-statememory, RAM, and the like. As will be appreciated by one of skill in theart, the functionality of the program modules may be combined ordistributed as desired in various embodiments. In addition, thefunctionality may be embodied, in whole or in part, in firmware orhardware equivalents such as integrated circuits, field programmablegate arrays (FPGA), and the like. Particular data structures may be usedto more effectively implement one or more aspects discussed herein, andsuch data structures are contemplated within the scope of computerexecutable instructions and computer-usable data described herein.Various aspects discussed herein may be embodied as a method, acomputing device, a system, and/or a computer program product.

Although the present invention has been described in certain specificaspects, many additional modifications and variations would be apparentto those skilled in the art. In particular, any of the various processesdescribed above may be performed in alternative sequences and/or inparallel (on different computing devices) in order to achieve similarresults in a manner that is more appropriate to the requirements of aspecific application. It is therefore to be understood that the presentinvention may be practiced otherwise than specifically described withoutdeparting from the scope and spirit of the present invention. Thus,embodiments of the present invention should be considered in allrespects as illustrative and not restrictive. Accordingly, the scope ofthe invention should be determined not by the embodiments illustrated,but by the appended claims and their equivalents.

What is claimed is:
 1. A computer-implemented method for training agenerative adversarial network comprising a generator and adiscriminator, comprising: obtaining training data comprising a set ofconversation data; obtaining a set of teacher forcing samples;generating, by the generator and based on the training data, initialautoregression data; generating, by the generator and based on the setof teacher forcing samples, initial teacher forcing data; calculating,by the discriminator and based on the initial autoregression data andthe initial teacher forcing data, a discriminator accuracy; training thediscriminator when the discriminator accuracy is below a discriminatorthreshold value; retraining the generator using a teacher forcing lossfunction of the generator when the discriminator accuracy is below agenerator threshold value; retraining the generator using the teacherforcing loss function and an autoregressive loss function when thediscriminator accuracy is above the generator threshold value; andstoring the trained generative adversarial network.
 2. Thecomputer-implemented method of claim 1, wherein the discriminatorcomprises a convolutional neural network.
 3. The computer-implementedmethod of claim 1, wherein the discriminator comprises a recurrentneural network.
 4. The computer-implemented method of claim 1, furthercomprising: sampling a subset of the initial autoregression data;sampling a subset of the initial teacher forcing data corresponding tothe subset of the initial autoregression data; and calculating thediscriminator accuracy based on the subset of the initial autoregressiondata and the subset of the initial teacher forcing data.
 5. Thecomputer-implemented method of claim 1, wherein retraining the generatorcomprises using a maximum likelihood estimation function.
 6. Thecomputer-implemented method of claim 1, further comprising: sampling arandom sample from the training data; and calculating the discriminatoraccuracy further based on the random sample.
 7. The computer-implementedmethod of claim 1, wherein the training data further comprisesautomatically generated noise data inserted into the training data. 8.The computer-implemented method of claim 1, wherein the generatorcomprises hierarchical recurrent encoder-decoder network.
 9. A devicefor training a generative adversarial network, comprising: a processor;and a memory in communication with the processor and storinginstructions that, when executed by the processor, cause the device to:obtain training data comprising a set of teacher forcing samples; train,using the set of teacher forcing samples and based on a maximumlikelihood estimation criterion, a generator of the generativeadversarial network; train, using the set of teacher forcing samples andbased on a maximum likelihood estimation criterion, a diversitydiscriminator of the generative adversarial network; generate, by thegenerator, initial response data based on the training data; evaluate,by the diversity discriminator, the initial response data based on acomparison to the training data; calculate, by the diversitydiscriminator and based on a comparison of the initial response data tothe training data, a discriminator accuracy; adjust one or moreparameters of the diversity discriminator when the discriminatoraccuracy is below a discriminator threshold value; adjust one or moreparameters of the generator using a teacher forcing loss function of thegenerator when the discriminator accuracy is below a generator thresholdvalue; adjust one or more parameters of the generator using the teacherforcing loss function and an autoregressive loss function when thediscriminator accuracy is above the generator threshold value; and storethe adjusted one or more parameters of the generator and the adjustedone or more parameters of the diversity discriminator.
 10. The device ofclaim 9, wherein the instructions, when read by the processor, furthercause the device to: generate, by the generator, adjusted response databased on the training data; train, using the adjusted response data, anexposure bias discriminator of the generative adversarial network;evaluate, by the exposure bias discriminator, the adjusted response databased on a comparison of the adjusted response data to the trainingdata; generate, by the exposure bias discriminator and based on theevaluation of the adjusted response data, ranked adjusted response data;and retrain, using the ranked adjusted response data, the generator. 11.The device of claim 10, wherein the instructions, when read by theprocessor, further cause the device to remove the diversitydiscriminator from the generative adversarial network when the generatorhas been retrained.
 12. The device of claim 10, wherein theinstructions, when read by the processor, further cause the device toremove the exposure bias discriminator from the generative adversarialnetwork when the generator has been retrained.
 13. The device of claim9, wherein the instructions, when read by the processor, further causethe device to: sample a random sample from the training data; andcalculate the discriminator accuracy further based on the random sample.14. The device of claim 9, wherein the training data further comprisesautomatically generated noise data inserted into the training data. 15.The device of claim 9, wherein the diversity discriminator comprises aconvolutional neural network.
 16. The device of claim 9, wherein thediversity discriminator comprises a recurrent neural network.
 17. Acomputer-implemented method for training a generative adversarialnetwork comprising a generator and a discriminator, comprising:obtaining training data comprising a set of conversation data; obtaininga set of teacher forcing samples; generating, by the generator and basedon the training data, initial autoregression data; generating, by thegenerator and based on the set of teacher forcing samples, initialteacher forcing data; sampling a subset of the initial autoregressiondata; sampling a subset of the initial teacher forcing datacorresponding to the subset of the initial autoregression data.calculating, by the discriminator and based on the subset of the initialautoregression data and the subset of the initial teacher forcing data,a discriminator accuracy; training the discriminator when thediscriminator accuracy is below a discriminator threshold value;retraining the generator using a teacher forcing loss function of thegenerator when the discriminator accuracy is below a generator thresholdvalue; retraining the generator using the teacher forcing loss functionand an autoregressive loss function when the discriminator accuracy isabove the generator threshold value; and storing the trained generativeadversarial network.
 18. The computer-implemented method of claim 17,wherein retraining the generator comprises using a maximum likelihoodestimation function.
 19. The computer-implemented method of claim 17,further comprising: sampling a random sample from the training data; andcalculating the discriminator accuracy further based on the randomsample.
 20. The computer-implemented method of claim 19, wherein thetraining data further comprises automatically generated noise datainserted into the training data.